Some analysis of Delhi Pollution Data¶
The datasets for pollution data in many countries are readily available and there has been a bunch of research already done on those. I wanted to see if I could do some analysis of the pollution patterns of Indian cities. Unfortunately, I could not find an openly accessible historical pollution dataset for Indian cities. The folks over at aqicn.org apparently provide access to institutions but not to individuals. In any case, I was able to locate a fantastic initiative by the Delhi Pollution Control Committee. They provide raw pollution data from six sensor clusters inside the city. While the availability could be better, and all sensor clusters do not cover all the metrics, this kind of data is incredibly useful. Kudos to them for having made this available! One problem is that they do not provide historical data, so I had to aggregate the realtime data over time. What follows is some analysis of that data. Hopefully, as the dataset grows, we'd be able to derive more insights about the pollution situation in delhi.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import itertools
import re
%matplotlib inline
Munging¶
The data I am dumping out into the csv file below is pretty raw. It look like:
rawdata = pd.read_csv('./netfile.csv', names=['location', 'metric', 'ts', 'reading', 'guidance'])
rawdata.sort_values(['ts', 'location'])
rawdata.head()
rawdata.location.value_counts()
rawdata.metric.value_counts()
Some of these metrics sound inferable: for example Nitrogen Dioxide and Nitrogen Oxide should give a good estimate for Oxides of Nitrogen where it does not exist independently. However, we'll look at that later. For now, let's munge this into a more useful dataframe
rawdata['ts'] = pd.to_datetime(rawdata.ts, unit='s')
def mungeReading(x):
return "".join([t[0] for t in re.findall("[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?", x)][:1]) if not (x.strip() == '-') else "coerce"
rawdata['reading'] = pd.to_numeric(rawdata.reading.apply(mungeReading), errors='coerce')
rawdata['guidance'] = pd.to_numeric(rawdata.guidance.apply(mungeReading), errors='coerce')
rawdata.head()
rawdata.reading.isnull().sum(), len(rawdata.reading), rawdata.guidance.isnull().sum(), len(rawdata.guidance)
So that looks reasonable and only a few readings are NaN. I expect that the guidance is a simple function of the metric and should not change that often. Let us check:
rawdata[['metric', 'guidance']].groupby('metric').guidance.value_counts()
Woah, loooks like there are multiple guidances per metric. For 9 records, the guidance for Ammonia, which ought to be 400 was listed as 3.0. Let us check those records..
rawdata[(rawdata['metric'] == 'Ammonia') & (rawdata['guidance'] == 3.0)]
rawdata[(rawdata['metric'] == 'Particulate Matter < 10 µg') & (rawdata['guidance'] == 3.0)]['reading'].value_counts()
rawdata[(rawdata['metric'] == 'Sulphur Dioxide') & (rawdata['guidance'] == 3.0)]['reading'].value_counts()
Looks like these are buggy readings, given that the guidance and reading both settle at 3.0, which clearly is an incorrect guidance for these particular metrics. I think that it is a good guess that all records with both guidance and readings at precisely 3.0 are spurious and should be dropped.
print(rawdata.shape)
rawdata = rawdata[(rawdata['guidance'] != 3.0) & (rawdata['reading'] != 3.0)]
rawdata.shape
rawdata[(rawdata['metric'] == 'Particulate Matter < 2.5 µg') & (rawdata['reading'] == 0.0)]
These zero guidance/readings also look incorrect. Lets drop these as well.
print(rawdata.shape)
rawdata = rawdata[(rawdata['guidance'] != 0.0) & (rawdata['reading'] != 0.0)]
rawdata.shape
rawdata[['metric', 'guidance']].groupby('metric').guidance.value_counts()
I think the data is cleaner now, and as we suspected, there is a single guidance per metric save for Vertical Wind Speed, which feels fairly reasonable. For other metrics, the guidance is NaN
rawdata = rawdata[['metric', 'location', 'reading', 'ts']]
Some simple plots¶
Lets take our first look at the data for a few metrics.
fig = plt.figure(figsize=(20,8))
ax1 = fig.add_subplot(311)
ax1.plot(rawdata[rawdata['metric'] == 'Particulate Matter < 2.5 µg']['reading'])
ax2 = fig.add_subplot(312)
ax2.plot(rawdata[rawdata['metric'] == 'Particulate Matter < 10 µg']['reading'])
ax2 = fig.add_subplot(313)
ax2.plot(rawdata[rawdata['metric'] == 'Ozone']['reading'])
I'd not have expected the peaks to add up to a constant value. There are large flat plateaus seem a bit strange to me, particularly given that this is aggregated data. Of course, the data for different stations is not synchronized from a timestamp perspective. Hence, it could be that the same readings occur at different stations but are staggered.
fig = plt.figure(figsize=(20,8))
ax1 = fig.add_subplot(421)
ax1.plot(rawdata[(rawdata['metric'] == 'Particulate Matter < 2.5 µg') & (rawdata['location'] == 'RK Puram')]['reading'])
ax2 = fig.add_subplot(423)
ax2.plot(rawdata[(rawdata['metric'] == 'Particulate Matter < 10 µg') & (rawdata['location'] == 'RK Puram')]['reading'])
ax2 = fig.add_subplot(425)
ax2.plot(rawdata[(rawdata['metric'] == 'Ozone') & (rawdata['location'] == 'RK Puram')]['reading'])
ax2 = fig.add_subplot(427)
ax2.plot(rawdata[(rawdata['metric'] == 'Nitrogen Dioxide') & (rawdata['location'] == 'RK Puram')]['reading'])
ax1 = fig.add_subplot(422)
ax1.plot(rawdata[(rawdata['metric'] == 'Particulate Matter < 2.5 µg') & (rawdata['location'] == 'Punjabi Bagh')]['reading'])
ax2 = fig.add_subplot(424)
ax2.plot(rawdata[(rawdata['metric'] == 'Particulate Matter < 10 µg') & (rawdata['location'] == 'Punjabi Bagh')]['reading'])
ax2 = fig.add_subplot(426)
ax2.plot(rawdata[(rawdata['metric'] == 'Ozone') & (rawdata['location'] == 'Punjabi Bagh')]['reading'])
ax2 = fig.add_subplot(428)
ax2.plot(rawdata[(rawdata['metric'] == 'Nitrogen Dioxide') & (rawdata['location'] == 'Punjabi Bagh')]['reading'])
There seems to be some periodicity in the PM 10mcg plot for RK Puram, but that does not hold for Punjabi Bagh. I am surprised by the flat plateaus that are seen in some areas. Could be due to lack of readings in the middle. It is however strange that the Nitrogen Dioxide metric for Punjabi Bagh seems to be constant even after a significant time, particularly given how noisy it has been otherwise. Also, very interestingly, for Punjabi Bagh, the plateaus in the data at the very end stack up at the same time for all the metrics. Looks like some issue with the website as stale data might have been returned for a bunch of time.
def getReadings(metrics='all', location='all'):
allReadings = []
if location == 'all':
rd = rawdata
else:
rd = rawdata[rawdata['location'] == location]
if metrics == 'all':
if location == 'all':
metrics = list(set(list(rawdata['metric'])))
else:
metrics = list(set(list(rawdata[rawdata['location'] == location]['metric'])))
for m in metrics:
allReadings.append(np.array(rd[rd['metric'] == m]['reading']))
#return pd.DataFrame(list(zip(allReadings))).transpose()
t = pd.DataFrame(allReadings).transpose().dropna()
t.columns = metrics
return t
sn.pairplot(getReadings(location='Punjabi Bagh'), size=3, kind="reg", diag_kind="kde")
This pairplot has a wealth of information. Lets quickly note down those pairs with good linear estimators and low error ranges.
- Ozone vs Relative Humidity
- $CO$ vs p-Xylene
- $NO_2$ and $NO$ vs Oxides of $N_2$ (Gives a sense of sanity)
- Solar Radiation vs N-oxides (Non linear, but somewhat hyperbolic)
- p-Xylene and Benzene
- p-Xylene and N-oxides
- p-Xylene and Toluene
- Toluene and Benzene
sn.pairplot(getReadings(metrics=['Toluene','Benzene', 'p-Xylene'], location='Punjabi Bagh'), size=3, kind="reg", diag_kind="kde")
Well, the linear relationship is not very strong. I had hoped that since all these contained the benzene ring, they might be well related. Interestingly, I think that we are seeing quantization in the Benzene values, suggesting that the sensor is imprecise. Just to confirm, lets check for another station.
sn.pairplot(getReadings(metrics=['Toluene','Benzene', 'p-Xylene'], location='IGI Airport'), size=3, kind="reg", diag_kind="kde")
sn.pairplot(getReadings(metrics=['Solar Radiation','Ozone', 'Sulphur Dioxide'], location='Punjabi Bagh'), size=3, kind="reg", diag_kind="kde")
No clear pattern seems to be emerging for this. Let's explore the pairplots for another station.
sn.pairplot(getReadings(location='RK Puram'), size=3, kind="reg")
def getCorrelatedMetrics(threshold, dframe, method='spearman'):
df = dframe.corr(method)
indices = np.where(df > threshold)
indices = [(df.index[x], df.columns[y]) for x, y in zip(*indices)
if x != y and x < y]
return sorted([(t, df[t[1]][t[0]]) for t in indices], key=lambda x: x[1], reverse=True)
def getCommonMetrics(locations='all'):
if locations == 'all':
locations = list(set(list(rawdata['location'])))
metricDict = {}
for location in locations:
metricDict[location] = set(rawdata[rawdata['location'] == location]['metric'])
commonMetrics = set.intersection(*metricDict.values())
return commonMetrics
getCommonMetrics()
getCorrelatedMetrics(0.5, getReadings(location='RK Puram'))
getCorrelatedMetrics(0.5, getReadings(location='Punjabi Bagh'))
plt.plot(rawdata[rawdata['metric'] == 'Nitrogen Dioxide']['reading'])
data = rawdata[['location', 'metric', 'ts', 'reading']].dropna()
data.groupby(['location', 'metric']).describe()